Skip to main content

Lineage for IBM products and Oracle

Data Lineage

Data lineage is the process of tracking the flow of data over time, providing a clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline.

During my work in ING-Diba, I have implemented a system that expanded the capabilities of IBM products.

Covered products:

  • IBM DataStage 11.7 + IBM InfoSphere Information Governance Catalog (IGC)
  • IBM Cloud Pak for Data + IBM Watson Catalogue

Involved products:

Oracle databases, Atlassian Confluence, Microsoft Azure, Amason S3, Cyberark, LDAP/Kerberos, Automic Automation (UC4)

Project target groups

architects, developers, DBAs, business analysts, data cleaning/compliance, gdpr

Lineage

I developed the lineage: complete control over all enterprise data flows. Using ANTLR, a system was written for lexical/grammatical parsing of all ETL/ELT processes, as well as all related Oracle SQL queries (of any complexity). The lineage includes a complete data flow graph and also a connection between the technical level and the business level. The technical part allows a complete detailed analysis of data flows and covers both the level of tables and ETL/ELT processes, as well as fields and variables. It is also possible to process enterprise-specific data flow connections (specific lineage cases). The system includes a service for automatically rebuilding the graph in case of any changes in ETL/ELT processes or database structures.

Lineage GUI:

  • interactive visualization of data flow graphs;

  • reporting;

  • statistics;

  • convenient navigation between graph nodes;

  • possibility of filtering the graph according to specified criteria.

Generating/Adjusting

Parsing algorithms serve as the basis for generating new ETL/ELT processes. It allows to generate ETL/ELT processes of almost any complexity with taking into account data historicization algorithms, such as SCD1, SCD2, Data Vault 2.0 etc, and in a free form too. The system includes mechanisms for batch configuration of multiple ETL/ELT processes, as well as preparation of ETL/ELT processes for deployment.

Data quality (DQ)

The module allows to automate DQ processes:

  • setting requirements for data verification;

  • generation of ETL/ELT processes to check DQ;

  • registration of verification rules;

  • accompanying automatically creation and updating of documentation.

Versioning

In addition, the system includes modules:

  • automatic versioning of ETL/ELT processes and all database objects

  • GUI for easy recovery of ETL/ELT processes

  • visualization of simultaneous comparison of revisions ETL/ELT processes in human reedable form

  • batch comparison of multiple ETL/ELT processes

  • global search

Technically, the system is a client-server solution, both with a direct http connection and using web sockets.

After developing the project, I transfer knowledge to the team developers for subsequent project support.